
AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
| Labels | Description |
|---|---|
| ID | Customer ID |
| Age | Customer’s age in completed years |
| Experience | years of professional experience |
| Income | Annual income of the customer (in thousand dollars) |
| ZIP Code | Home Address ZIP code. |
| Family | the Family size of the customer |
| CCAvg | Average spending on credit cards per month (in thousand dollars) |
| Education | Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional |
| Mortgage | Value of house mortgage if any. (in thousand dollars) |
| Personal_Loan | Did this customer accept the personal loan offered in the last campaign? |
| Securities_Account | Does the customer have securities account with the bank? |
| CD_Account | Does the customer have a certificate of deposit (CD) account with the bank? |
| Online | Do customers use internet banking facilities? |
| CreditCard | Does the customer use a credit card issued by any other Bank (excluding All life Bank) |
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats as stats
# to split the data into train and test
from sklearn import metrics, tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
import warnings
warnings.filterwarnings('ignore')
sns.set()
# Removes the limit from the number of displayed columns and rows.
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_rows', 200)
print('Loading Libraries Complete')
Loading Libraries Complete
df = pd.read_csv('Loan_modelling.csv')
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns')
There are 5000 rows and 14 columns
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
df.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
# Look to see if there are any duplicate values in the data
print(f'There are {df.duplicated().sum()} duplicate values in the data')
There are 0 duplicate values in the data
# Take a look at all of the colums
df.columns
Index(['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard'],
dtype='object')
# Convert all column names to lower case, change creditcard to credit_card, change ccavg to cc_avg, change zipcode to zip
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace('zipcode', 'zip')
df.columns = df.columns.str.replace('ccavg', 'cc_avg')
df.columns = df.columns.str.replace('creditcard', 'credit_card')
df.columns
Index(['id', 'age', 'experience', 'income', 'zip', 'family', 'cc_avg',
'education', 'mortgage', 'personal_loan', 'securities_account',
'cd_account', 'online', 'credit_card'],
dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 5000 non-null int64 1 age 5000 non-null int64 2 experience 5000 non-null int64 3 income 5000 non-null int64 4 zip 5000 non-null int64 5 family 5000 non-null int64 6 cc_avg 5000 non-null float64 7 education 5000 non-null int64 8 mortgage 5000 non-null int64 9 personal_loan 5000 non-null int64 10 securities_account 5000 non-null int64 11 cd_account 5000 non-null int64 12 online 5000 non-null int64 13 credit_card 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
# Convert the family and education columns into Categorical columns
categorical_columns = ['family', 'education']
for feature in categorical_columns:
df[feature] = pd.Categorical(df[feature])
# Look at the data again to see how it has changed
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 5000 non-null int64 1 age 5000 non-null int64 2 experience 5000 non-null int64 3 income 5000 non-null int64 4 zip 5000 non-null int64 5 family 5000 non-null category 6 cc_avg 5000 non-null float64 7 education 5000 non-null category 8 mortgage 5000 non-null int64 9 personal_loan 5000 non-null int64 10 securities_account 5000 non-null int64 11 cd_account 5000 non-null int64 12 online 5000 non-null int64 13 credit_card 5000 non-null int64 dtypes: category(2), float64(1), int64(11) memory usage: 479.0 KB
df.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| id | 5000.0 | NaN | NaN | NaN | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| age | 5000.0 | NaN | NaN | NaN | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| experience | 5000.0 | NaN | NaN | NaN | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| income | 5000.0 | NaN | NaN | NaN | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| zip | 5000.0 | NaN | NaN | NaN | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| family | 5000.0 | 4.0 | 1.0 | 1472.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| cc_avg | 5000.0 | NaN | NaN | NaN | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| education | 5000.0 | 3.0 | 1.0 | 2096.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mortgage | 5000.0 | NaN | NaN | NaN | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| personal_loan | 5000.0 | NaN | NaN | NaN | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| securities_account | 5000.0 | NaN | NaN | NaN | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| cd_account | 5000.0 | NaN | NaN | NaN | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| online | 5000.0 | NaN | NaN | NaN | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| credit_card | 5000.0 | NaN | NaN | NaN | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
There are only 4 unique values in family and 3 unique values in education that we will use onehotencoding to break that out into 5 columns
The following columns have a min value of 0 and a max value of 1 indicating that these values indicate 0 for false, and 1 for true
age has a mean of 45, a standard deviation of 11.46, minimum of 23 and maximum of 67
df.isnull().sum().sort_values(ascending=False)
id 0 age 0 experience 0 income 0 zip 0 family 0 cc_avg 0 education 0 mortgage 0 personal_loan 0 securities_account 0 cd_account 0 online 0 credit_card 0 dtype: int64
df.isnull().values.any()
False
outlier_df = df.select_dtypes(include=['int64', 'float64'])
outlier_df.skew()
id 0.000000 age -0.029341 experience -0.026325 income 0.841339 zip -0.296165 cc_avg 1.598443 mortgage 2.104002 personal_loan 2.743607 securities_account 2.588268 cd_account 3.691714 online -0.394785 credit_card 0.904589 dtype: float64
# function to combine a boxplot with a histogram
def histogram_boxplot(feature, figsize=(15,7), bins=None):
'''
Boxplot and histogram combined
feature: one dimensional feature array
figsize: size of the figure to be output (defualt (15,7))
bine: number of bins (default None / auto)
'''
f2, (ax_box2, ax_hist2) = plt.subplots(nrows= 2,
sharex = True,
gridspec_kw = {'height_ratios': (.25, .75)},
figsize = figsize
)
sns.boxplot(feature, ax=ax_box2, showmeans=True, color='yellow')
sns.displot(feature, kde=True, ax=ax_hist2, bins=bins) if bins else sns.displot(feature, kde=True, ax=ax_hist2)
ax_hist2.axvline(np.mean(feature), color='green', linestyle='--')
ax_hist2.axvline(np.median(feature), color='blue', linestyle='-')
# functions to treat outliers by flooring and capping
def create_outliers(feature: str, dataframe=df):
'''
Returns a dataframe object of feature outliers
feature: one dimensional feature array as a string
df: a pandas dataframe. Default is df
'''
Q1 = dataframe[feature].quantile(0.25)
Q3 = dataframe[feature].quantile(0.75)
IQR = Q3 - Q1
return dataframe[((dataframe[feature] < (Q1 - 1.5 * IQR)) | (dataframe[feature] > (Q3 + 1.5 * IQR)))]
# draw a barplot with percentage over the top of the bar
def percent_bar(ax, feature):
'''
Shows a bar chart with a percentage on th top of each bar
plot: the graph
feature : the categorical feature
'''
total = len(feature)
for p in ax.patches:
percentage = 100 * p.get_height()/total
percentage_label = f'{percentage:.1f}%'
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
ax.annotate(percentage_label, (x,y), size = 12)
# Draw stacked bar chart
def stacked_barplot(x, y):
'''
Shows stacked bar plot from an x and y data set
x: first data set
y: second data set
'''
data = pd.crosstab(x, y, margins=True)
data['% - 0'] = round(data[0]/data['All']*100,2)
data['% - 1'] = round(data[1]/data['All']*100,2)
print(data)
print('='*40)
visualize = pd.crosstab(x, y, normalize='index')
visualize.plot(kind='bar', stacked=True, figsize=(10,5));
# Draw box plots
def draw_boxplots(columns: list, feature: str, data=df, show_fliers=True):
rows_number = math.ceil(len(columns)/2)
plt.figure(figsize=(15, rows_number*5))
for i, variable in enumerate(columns):
plt.subplot(rows_number, 2, i+1)
if show_fliers:
sns.boxplot(data[feature], data[variable], palette='mako', showfliers=True)
else:
sns.boxplot(data[feature], data[variable], palette='mako', showfliers=False)
plt.tight_layout()
plt.title(variable, fontsize=12)
plt.show
def determine_significance(feature1: str, feature2: str, data=df):
'''
determines the significance of feature1 compared to feature2
feature1: column name
feature2: column name
data: dataframe to be analyzed
'''
crosstab = pd.crosstab(data[feature1], data[feature2])
chi, p_value, dof, expected = stats.chi2_contingency(crosstab)
Ho = f'{feature1} has no effect on {feature2}'
Ha = f'{feature1} has an effect on {feature2}'
if p_value < 0.05:
print(f'{Ha.upper()} as the p_value ({p_value.round(3)}) < 0.05')
else:
print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.05')
def show_significance(features: list, data=df):
'''
Prints out the significance of all the feature in a list of features
features: list of column_names
data: dataframe to be analyzed
'''
for feature in features:
print('_'*15, feature, '_'*(50-len(feature)))
for column in list(data.columns):
if column != feature: determine_significance(column, feature)
def importance_plot(model):
'''
Display a feature importance barplot
model: decision tree classifier model
'''
importances = model.feature_importances_
indices = np.argsort(importances)
size = len(indices)//2
plt.figure(figsize=(10, size))
plt.title('Feature Importances', fontsize=14)
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance', fontsize=12);
# generate distplots of all numeric colums to see how the data in each column skews
plt.figure(figsize=(25,25))
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(numeric_columns):
plt.subplot(10,3,i+1)
sns.distplot(df[variable], kde=False, color='blue')
plt.tight_layout()
plt.title(variable)
### Age
histogram_boxplot(df.age)
### income
histogram_boxplot(df.income)
outliers = create_outliers('income', df)
outliers.sort_values(by='income', ascending=False).head(20)
| id | age | experience | income | zip | family | cc_avg | education | mortgage | personal_loan | securities_account | cd_account | online | credit_card | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3896 | 3897 | 48 | 24 | 224 | 93940 | 2 | 6.67 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
| 4993 | 4994 | 45 | 21 | 218 | 91801 | 2 | 6.67 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 526 | 527 | 26 | 2 | 205 | 93106 | 1 | 6.33 | 1 | 271 | 0 | 0 | 0 | 0 | 1 |
| 2988 | 2989 | 46 | 21 | 205 | 95762 | 2 | 8.80 | 1 | 181 | 0 | 1 | 0 | 1 | 0 |
| 4225 | 4226 | 43 | 18 | 204 | 91902 | 2 | 8.80 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 677 | 678 | 46 | 21 | 204 | 92780 | 2 | 2.80 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2278 | 2279 | 30 | 4 | 204 | 91107 | 2 | 4.50 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3804 | 3805 | 47 | 22 | 203 | 95842 | 2 | 8.80 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2101 | 2102 | 35 | 5 | 203 | 95032 | 1 | 10.00 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 787 | 788 | 45 | 15 | 202 | 91380 | 3 | 10.00 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3608 | 3609 | 59 | 35 | 202 | 94025 | 1 | 4.70 | 1 | 553 | 0 | 0 | 0 | 0 | 0 |
| 4895 | 4896 | 45 | 20 | 201 | 92120 | 2 | 2.80 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 2337 | 2338 | 43 | 16 | 201 | 95054 | 1 | 10.00 | 2 | 0 | 1 | 0 | 0 | 0 | 1 |
| 2447 | 2448 | 44 | 19 | 201 | 95819 | 2 | 8.80 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1901 | 1902 | 43 | 19 | 201 | 94305 | 2 | 6.67 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
| 1711 | 1712 | 27 | 3 | 201 | 95819 | 1 | 6.33 | 1 | 158 | 0 | 0 | 0 | 1 | 0 |
| 1716 | 1717 | 32 | 8 | 200 | 91330 | 2 | 6.50 | 1 | 565 | 0 | 0 | 0 | 1 | 0 |
| 459 | 460 | 35 | 10 | 200 | 91107 | 2 | 3.00 | 1 | 458 | 0 | 0 | 0 | 0 | 0 |
| 917 | 918 | 45 | 20 | 200 | 90405 | 2 | 8.80 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4659 | 4660 | 28 | 4 | 199 | 92121 | 1 | 6.33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
print(f'There are {outliers.shape[0]} ouliers')
There are 96 ouliers
### cc_avg
histogram_boxplot(df.cc_avg)
outliers = create_outliers('cc_avg')
outliers.sort_values(by='cc_avg', ascending=False).head(20)
| id | age | experience | income | zip | family | cc_avg | education | mortgage | personal_loan | securities_account | cd_account | online | credit_card | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2337 | 2338 | 43 | 16 | 201 | 95054 | 1 | 10.0 | 2 | 0 | 1 | 0 | 0 | 0 | 1 |
| 787 | 788 | 45 | 15 | 202 | 91380 | 3 | 10.0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2101 | 2102 | 35 | 5 | 203 | 95032 | 1 | 10.0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3943 | 3944 | 61 | 36 | 188 | 91360 | 1 | 9.3 | 2 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3822 | 3823 | 63 | 33 | 178 | 91768 | 4 | 9.0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1339 | 1340 | 52 | 25 | 180 | 94545 | 2 | 9.0 | 2 | 297 | 1 | 0 | 0 | 1 | 0 |
| 9 | 10 | 34 | 9 | 180 | 93023 | 1 | 8.9 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1277 | 1278 | 45 | 20 | 194 | 92110 | 2 | 8.8 | 1 | 428 | 0 | 0 | 0 | 0 | 0 |
| 3312 | 3313 | 47 | 22 | 190 | 94550 | 2 | 8.8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4225 | 4226 | 43 | 18 | 204 | 91902 | 2 | 8.8 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2988 | 2989 | 46 | 21 | 205 | 95762 | 2 | 8.8 | 1 | 181 | 0 | 1 | 0 | 1 | 0 |
| 2447 | 2448 | 44 | 19 | 201 | 95819 | 2 | 8.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 881 | 882 | 44 | 19 | 154 | 92116 | 2 | 8.8 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 917 | 918 | 45 | 20 | 200 | 90405 | 2 | 8.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 2769 | 2770 | 33 | 9 | 183 | 91320 | 2 | 8.8 | 3 | 582 | 1 | 0 | 0 | 1 | 0 |
| 3804 | 3805 | 47 | 22 | 203 | 95842 | 2 | 8.8 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1797 | 1798 | 35 | 10 | 143 | 91365 | 1 | 8.6 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4156 | 4157 | 37 | 12 | 193 | 92780 | 1 | 8.6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 614 | 615 | 37 | 12 | 180 | 90034 | 1 | 8.6 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4603 | 4604 | 37 | 12 | 179 | 91768 | 1 | 8.6 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
print(f'There are {outliers.shape[0]} outliers')
There are 324 outliers
### mortgage
histogram_boxplot(df.mortgage)
outliers = create_outliers('mortgage')
outliers.sort_values(by='mortgage', ascending=False).head(20)
| id | age | experience | income | zip | family | cc_avg | education | mortgage | personal_loan | securities_account | cd_account | online | credit_card | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2934 | 2935 | 37 | 13 | 195 | 91763 | 2 | 6.5 | 1 | 635 | 0 | 0 | 0 | 1 | 0 |
| 303 | 304 | 49 | 25 | 195 | 95605 | 4 | 3.0 | 1 | 617 | 1 | 0 | 0 | 0 | 0 |
| 4812 | 4813 | 29 | 4 | 184 | 92126 | 4 | 2.2 | 3 | 612 | 1 | 0 | 0 | 1 | 0 |
| 1783 | 1784 | 53 | 27 | 192 | 94720 | 1 | 1.7 | 1 | 601 | 0 | 0 | 0 | 1 | 0 |
| 4842 | 4843 | 49 | 23 | 174 | 95449 | 3 | 4.6 | 2 | 590 | 1 | 0 | 0 | 0 | 0 |
| 1937 | 1938 | 51 | 25 | 181 | 95051 | 1 | 3.3 | 3 | 589 | 1 | 1 | 1 | 1 | 0 |
| 782 | 783 | 54 | 30 | 194 | 92056 | 3 | 6.0 | 3 | 587 | 1 | 1 | 1 | 1 | 1 |
| 2769 | 2770 | 33 | 9 | 183 | 91320 | 2 | 8.8 | 3 | 582 | 1 | 0 | 0 | 1 | 0 |
| 4655 | 4656 | 33 | 7 | 188 | 95054 | 2 | 7.0 | 2 | 581 | 1 | 0 | 0 | 0 | 0 |
| 4345 | 4346 | 26 | 1 | 184 | 94608 | 2 | 4.2 | 3 | 577 | 1 | 0 | 1 | 1 | 1 |
| 4585 | 4586 | 35 | 11 | 180 | 94010 | 1 | 3.6 | 3 | 571 | 1 | 0 | 1 | 1 | 1 |
| 2541 | 2542 | 34 | 8 | 171 | 90212 | 2 | 2.2 | 2 | 569 | 1 | 0 | 0 | 1 | 0 |
| 1789 | 1790 | 44 | 20 | 171 | 91330 | 4 | 0.7 | 1 | 567 | 1 | 0 | 1 | 1 | 1 |
| 2841 | 2842 | 37 | 11 | 190 | 94305 | 4 | 7.3 | 2 | 565 | 1 | 0 | 1 | 1 | 0 |
| 1716 | 1717 | 32 | 8 | 200 | 91330 | 2 | 6.5 | 1 | 565 | 0 | 0 | 0 | 1 | 0 |
| 3608 | 3609 | 59 | 35 | 202 | 94025 | 1 | 4.7 | 1 | 553 | 0 | 0 | 0 | 0 | 0 |
| 4672 | 4673 | 52 | 26 | 180 | 95831 | 1 | 1.7 | 1 | 550 | 0 | 0 | 0 | 1 | 0 |
| 473 | 474 | 64 | 39 | 182 | 93955 | 1 | 1.2 | 2 | 547 | 1 | 0 | 0 | 1 | 0 |
| 4859 | 4860 | 34 | 8 | 165 | 91107 | 1 | 7.0 | 3 | 541 | 1 | 0 | 0 | 0 | 0 |
| 2041 | 2042 | 45 | 20 | 180 | 95403 | 3 | 8.5 | 2 | 535 | 1 | 0 | 0 | 0 | 0 |
print(f'There are {outliers.shape[0]} outliers')
There are 291 outliers
### Check zero values in mortgage column
print(f'the mortgage column has {df[df.mortgage==0].shape[0]} rows where the value = zero')
print(f'This is {(df[df.mortgage==0].shape[0] / df.shape[0]) * 100}% of the rows in the dataset')
the mortgage column has 3462 rows where the value = zero This is 69.24% of the rows in the dataset
### Ploy out the frequency of zero mortgage by zip
plt.figure(figsize=(30,20))
sns.countplot(
y=df[df.mortgage==0]['zip'],
data=df,
order=df[df.mortgage==0]['zip'].value_counts().index[:40]
);
### experience
histogram_boxplot(df.experience)
plt.figure(figsize=(30,20))
sns.countplot(y=df.experience,
data=df,
order=df.experience.value_counts().index[:]);
print(f'There are {df[df.experience<0].shape[0]} rows that have experience less than zero')
df[df.experience<0].sort_values(by='experience', ascending=True).head()
There are 52 rows that have experience less than zero
| id | age | experience | income | zip | family | cc_avg | education | mortgage | personal_loan | securities_account | cd_account | online | credit_card | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4514 | 4515 | 24 | -3 | 41 | 91768 | 4 | 1.0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2618 | 2619 | 23 | -3 | 55 | 92704 | 3 | 2.4 | 2 | 145 | 0 | 0 | 0 | 1 | 0 |
| 4285 | 4286 | 23 | -3 | 149 | 93555 | 2 | 7.2 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3626 | 3627 | 24 | -3 | 28 | 90089 | 4 | 1.0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2717 | 2718 | 23 | -2 | 45 | 95422 | 4 | 0.6 | 2 | 0 | 0 | 0 | 0 | 1 | 1 |
### Visualize experience < 0 with age
plt.figure(figsize=(20,8))
sns.countplot(y=df[df.experience<0]['age'],
data=df,
order=df[df.experience<0]['age'].value_counts().index[:]);
Negative values will be converted to the absoute value of the experience, converting the negative to the same positive number
df['experience'] = np.abs(df.experience)
df.sort_values(by='experience', ascending=True).head(10)
| id | age | experience | income | zip | family | cc_avg | education | mortgage | personal_loan | securities_account | cd_account | online | credit_card | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2756 | 2757 | 27 | 0 | 40 | 91301 | 4 | 1.0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2009 | 2010 | 25 | 0 | 99 | 92735 | 1 | 1.9 | 1 | 323 | 0 | 0 | 0 | 0 | 0 |
| 4393 | 4394 | 24 | 0 | 59 | 95521 | 4 | 1.6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 347 | 348 | 25 | 0 | 43 | 94305 | 2 | 1.6 | 3 | 0 | 0 | 1 | 1 | 1 | 1 |
| 4425 | 4426 | 26 | 0 | 164 | 95973 | 2 | 4.0 | 3 | 301 | 1 | 0 | 0 | 1 | 0 |
| 1847 | 1848 | 25 | 0 | 52 | 95126 | 3 | 2.6 | 3 | 159 | 0 | 0 | 0 | 0 | 0 |
| 1765 | 1766 | 26 | 0 | 149 | 95051 | 2 | 7.2 | 1 | 154 | 0 | 0 | 0 | 0 | 0 |
| 363 | 364 | 25 | 0 | 30 | 92691 | 2 | 1.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1732 | 1733 | 25 | 0 | 88 | 94566 | 2 | 1.8 | 2 | 319 | 0 | 0 | 0 | 1 | 1 |
| 3908 | 3909 | 24 | 0 | 44 | 90638 | 3 | 0.1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
histogram_boxplot(df.experience)
plt.figure(figsize=(15,10))
sns.countplot(y=df.experience,
data=df,
order=df.experience.value_counts().index[:]);
features = ['age', 'experience', 'income', 'cc_avg', 'mortgage', 'zip']
n_rows = math.ceil(len(features)/3)
plt.figure(figsize=(15, n_rows*3.5))
for i, feature in enumerate(list(features)):
plt.subplot(n_rows, 3, i+1)
plt.hist(df[feature])
plt.tight_layout()
plt.title(feature, fontsize=15);
### Outliers in numberical columns
plt.figure(figsize=(15, n_rows*4))
for i, feature in enumerate(features):
plt.subplot(n_rows, 3, i+1)
plt.boxplot(df[feature], whis=1.5)
plt.tight_layout()
plt.title(feature, fontsize=15);
### Value Counts of non-numerical columns
display_number = 20
for colname in df.dtypes[df.dtypes=='category'].index:
val_counts = df[colname].value_counts(dropna=False)
print(f'Column: {colname}')
print('-'*20)
print(val_counts[:display_number])
if len(val_counts) > display_number:
print(f'This is only the first {display_number} of {len(val_counts)} total')
print('\n')
Column: family -------------------- 1 1472 2 1296 4 1222 3 1010 Name: family, dtype: int64 Column: education -------------------- 1 2096 3 1501 2 1403 Name: education, dtype: int64
### A look at zip codes
plt.figure(figsize=(15,15))
sns.countplot(y='zip', data=df, order=df.zip.value_counts().index[0:50]);
### A look at the family data
plt.figure(figsize=(15,7))
ax = sns.countplot(df.family, palette='mako')
percent_bar(ax, df.family)
### A look at education
plt.figure(figsize=(15,7))
ax = sns.countplot(df.education, palette='mako')
percent_bar(ax, df.education)
### A look at personal_loan
plt.figure(figsize=(15,7))
ax = sns.countplot(df.personal_loan, palette='mako')
percent_bar(ax, df.personal_loan)
### A look at securities_account
plt.figure(figsize=(15,7))
ax = sns.countplot(df.securities_account, palette='mako')
percent_bar(ax, df.securities_account)
### A look at cd_account
plt.figure(figsize=(15,7))
ax = sns.countplot(df.cd_account, palette='mako')
percent_bar(ax, df.cd_account)
### A look at online
plt.figure(figsize=(15,7))
ax = sns.countplot(df.online, palette='mako')
percent_bar(ax, df.online)
### A look at credit_card
plt.figure(figsize=(15,7))
ax = sns.countplot(df.credit_card, palette='mako')
percent_bar(ax, df.credit_card)
plt.figure(figsize=(12,12))
sns.heatmap(df.corr(), annot=True, cmap='YlGnBu')
plt.show()
sns.pairplot(data=df[['age','income','zip','cc_avg','mortgage','experience','personal_loan']],
hue='personal_loan', diag_kind='kde')
plt.show()
### Boxplots without outliers
columns = ['age','income','cc_avg','mortgage','experience']
draw_boxplots(columns, 'personal_loan', show_fliers=False);
### Persoanl Loan vs Family Size
stacked_barplot(df.family, df.personal_loan)
personal_loan 0 1 All % - 0 % - 1 family 1 1365 107 1472 92.73 7.27 2 1190 106 1296 91.82 8.18 3 877 133 1010 86.83 13.17 4 1088 134 1222 89.03 10.97 All 4520 480 5000 90.40 9.60 ========================================
### Personal Loan vs Education
stacked_barplot(df.education, df.personal_loan)
personal_loan 0 1 All % - 0 % - 1 education 1 2003 93 2096 95.56 4.44 2 1221 182 1403 87.03 12.97 3 1296 205 1501 86.34 13.66 All 4520 480 5000 90.40 9.60 ========================================
### Personal Loan vs Securities Account
stacked_barplot(df.securities_account, df.personal_loan)
personal_loan 0 1 All % - 0 % - 1 securities_account 0 4058 420 4478 90.62 9.38 1 462 60 522 88.51 11.49 All 4520 480 5000 90.40 9.60 ========================================
### Personaal Loan vs Certificate of Deposit Account
stacked_barplot(df.cd_account, df.personal_loan)
personal_loan 0 1 All % - 0 % - 1 cd_account 0 4358 340 4698 92.76 7.24 1 162 140 302 53.64 46.36 All 4520 480 5000 90.40 9.60 ========================================
### Personal Loan vs Online
stacked_barplot(df.online, df.personal_loan)
personal_loan 0 1 All % - 0 % - 1 online 0 1827 189 2016 90.62 9.38 1 2693 291 2984 90.25 9.75 All 4520 480 5000 90.40 9.60 ========================================
### Personal Loan vs credit_card
stacked_barplot(df.credit_card, df.personal_loan)
personal_loan 0 1 All % - 0 % - 1 credit_card 0 3193 337 3530 90.45 9.55 1 1327 143 1470 90.27 9.73 All 4520 480 5000 90.40 9.60 ========================================
### What features are statistically significant
show_significance(['personal_loan', 'cd_account'], df)
_______________ personal_loan _____________________________________ id has no effect on personal_loan as the p_value (0.493) > 0.05 age has no effect on personal_loan as the p_value (0.12) > 0.05 experience has no effect on personal_loan as the p_value (0.805) > 0.05 INCOME HAS AN EFFECT ON PERSONAL_LOAN as the p_value (0.0) < 0.05 zip has no effect on personal_loan as the p_value (0.76) > 0.05 FAMILY HAS AN EFFECT ON PERSONAL_LOAN as the p_value (0.0) < 0.05 CC_AVG HAS AN EFFECT ON PERSONAL_LOAN as the p_value (0.0) < 0.05 EDUCATION HAS AN EFFECT ON PERSONAL_LOAN as the p_value (0.0) < 0.05 MORTGAGE HAS AN EFFECT ON PERSONAL_LOAN as the p_value (0.0) < 0.05 securities_account has no effect on personal_loan as the p_value (0.141) > 0.05 CD_ACCOUNT HAS AN EFFECT ON PERSONAL_LOAN as the p_value (0.0) < 0.05 online has no effect on personal_loan as the p_value (0.693) > 0.05 credit_card has no effect on personal_loan as the p_value (0.884) > 0.05 _______________ cd_account ________________________________________ id has no effect on cd_account as the p_value (0.493) > 0.05 AGE HAS AN EFFECT ON CD_ACCOUNT as the p_value (0.027) < 0.05 experience has no effect on cd_account as the p_value (0.072) > 0.05 INCOME HAS AN EFFECT ON CD_ACCOUNT as the p_value (0.0) < 0.05 zip has no effect on cd_account as the p_value (0.675) > 0.05 FAMILY HAS AN EFFECT ON CD_ACCOUNT as the p_value (0.018) < 0.05 CC_AVG HAS AN EFFECT ON CD_ACCOUNT as the p_value (0.0) < 0.05 education has no effect on cd_account as the p_value (0.58) > 0.05 MORTGAGE HAS AN EFFECT ON CD_ACCOUNT as the p_value (0.0) < 0.05 PERSONAL_LOAN HAS AN EFFECT ON CD_ACCOUNT as the p_value (0.0) < 0.05 SECURITIES_ACCOUNT HAS AN EFFECT ON CD_ACCOUNT as the p_value (0.0) < 0.05 ONLINE HAS AN EFFECT ON CD_ACCOUNT as the p_value (0.0) < 0.05 CREDIT_CARD HAS AN EFFECT ON CD_ACCOUNT as the p_value (0.0) < 0.05
### Prepare the Data
#### get_dummies for education and family
df_dummies = pd.get_dummies(df, columns=['education', 'family'], drop_first=True)
df_dummies.head()
| id | age | experience | income | zip | cc_avg | mortgage | personal_loan | securities_account | cd_account | online | credit_card | education_2 | education_3 | family_2 | family_3 | family_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 1.6 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 1.5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 2.7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 1.0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
df_dummies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 5000 non-null int64 1 age 5000 non-null int64 2 experience 5000 non-null int64 3 income 5000 non-null int64 4 zip 5000 non-null int64 5 cc_avg 5000 non-null float64 6 mortgage 5000 non-null int64 7 personal_loan 5000 non-null int64 8 securities_account 5000 non-null int64 9 cd_account 5000 non-null int64 10 online 5000 non-null int64 11 credit_card 5000 non-null int64 12 education_2 5000 non-null uint8 13 education_3 5000 non-null uint8 14 family_2 5000 non-null uint8 15 family_3 5000 non-null uint8 16 family_4 5000 non-null uint8 dtypes: float64(1), int64(11), uint8(5) memory usage: 493.3 KB
### Partition the data to prepare for building the model
X = df_dummies.drop(['personal_loan'], axis=1)
y = df_dummies['personal_loan']
### Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print(f'The shape of X_train: {X_train.shape}')
print(f'The shape of X_test: {X_test.shape}')
The shape of X_train: (3500, 16) The shape of X_test: (1500, 16)
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0], x_test=X_test):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(x_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ['Actual - No','Actual - Yes']],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = [f'{value:0.0f}'for value in cm.flatten()]
group_percentages = [f'{value:.2%}' for value in cm.flatten()/np.sum(cm)]
labels = [f'{v1}\n{v2}' for v1, v2 in zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels, fmt='')
plt.ylabel('True label', fontsize=14)
plt.xlabel('Predicted label', fontsize=14)
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print('Recall on training set : ',metrics.recall_score(y_train,pred_train))
print('Recall on test set : ',metrics.recall_score(y_test,pred_test))
### Initial Decision Tree
# use gini
# user class_weight of 0:0.15, 1:0.85
# random_state=1
model = DecisionTreeClassifier(criterion='gini',
class_weight={0:0.15, 1:0.85},
random_state=1)
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
make_confusion_matrix(model, y_test)
y_train.value_counts(normalize=True)
0 0.905429 1 0.094571 Name: personal_loan, dtype: float64
# get recall on train and test
get_recall_score(model)
Recall on training set : 1.0 Recall on test set : 0.8791946308724832
feature_names = list(X.columns)
plt.figure(figsize=(20,30))
out = tree.plot_tree(model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- income <= 98.50 | |--- cc_avg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- cc_avg > 2.95 | | |--- cd_account <= 0.50 | | | |--- cc_avg <= 3.95 | | | | |--- income <= 81.50 | | | | | |--- age <= 36.50 | | | | | | |--- family_4 <= 0.50 | | | | | | | |--- cc_avg <= 3.50 | | | | | | | | |--- family_3 <= 0.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- family_3 > 0.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- cc_avg > 3.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- family_4 > 0.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- age > 36.50 | | | | | | |--- zip <= 91269.00 | | | | | | | |--- id <= 1184.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- id > 1184.50 | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | |--- zip > 91269.00 | | | | | | | |--- mortgage <= 54.00 | | | | | | | | |--- weights: [4.05, 0.00] class: 0 | | | | | | | |--- mortgage > 54.00 | | | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | | |--- income > 81.50 | | | | | |--- id <= 934.50 | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | |--- id > 934.50 | | | | | | |--- zip <= 95084.00 | | | | | | | |--- cc_avg <= 3.05 | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | |--- cc_avg > 3.05 | | | | | | | | |--- mortgage <= 173.00 | | | | | | | | | |--- id <= 3334.00 | | | | | | | | | | |--- id <= 1925.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- id > 1925.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- id > 3334.00 | | | | | | | | | | |--- weights: [0.00, 5.95] class: 1 | | | | | | | | |--- mortgage > 173.00 | | | | | | | | | |--- online <= 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- online > 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- zip > 95084.00 | | | | | | | |--- mortgage <= 56.00 | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | |--- mortgage > 56.00 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | |--- cc_avg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- cd_account > 0.50 | | | |--- id <= 766.50 | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- id > 766.50 | | | | |--- weights: [0.00, 6.80] class: 1 |--- income > 98.50 | |--- education_3 <= 0.50 | | |--- education_2 <= 0.50 | | | |--- family_3 <= 0.50 | | | | |--- family_4 <= 0.50 | | | | | |--- income <= 100.00 | | | | | | |--- zip <= 91169.00 | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | |--- zip > 91169.00 | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- income > 100.00 | | | | | | |--- income <= 103.50 | | | | | | | |--- securities_account <= 0.50 | | | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | | | |--- securities_account > 0.50 | | | | | | | | |--- cc_avg <= 3.06 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- cc_avg > 3.06 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- income > 103.50 | | | | | | | |--- weights: [64.95, 0.00] class: 0 | | | | |--- family_4 > 0.50 | | | | | |--- income <= 102.00 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- income > 102.00 | | | | | | |--- age <= 34.00 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- age > 34.00 | | | | | | | |--- weights: [0.00, 15.30] class: 1 | | | |--- family_3 > 0.50 | | | | |--- income <= 108.50 | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | |--- income > 108.50 | | | | | |--- zip <= 90019.50 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- zip > 90019.50 | | | | | | |--- age <= 26.00 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- age > 26.00 | | | | | | | |--- income <= 118.00 | | | | | | | | |--- id <= 2808.00 | | | | | | | | | |--- age <= 38.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- age > 38.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- id > 2808.00 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- income > 118.00 | | | | | | | | |--- weights: [0.00, 28.05] class: 1 | | |--- education_2 > 0.50 | | | |--- income <= 110.50 | | | | |--- cc_avg <= 3.54 | | | | | |--- income <= 106.50 | | | | | | |--- income <= 99.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | |--- income > 99.50 | | | | | | | |--- weights: [3.30, 0.00] class: 0 | | | | | |--- income > 106.50 | | | | | | |--- experience <= 27.00 | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | |--- experience > 27.00 | | | | | | | |--- cc_avg <= 1.85 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- cc_avg > 1.85 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- cc_avg > 3.54 | | | | | |--- weights: [0.00, 2.55] class: 1 | | | |--- income > 110.50 | | | | |--- income <= 116.50 | | | | | |--- mortgage <= 141.50 | | | | | | |--- experience <= 35.50 | | | | | | | |--- cc_avg <= 1.20 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- cc_avg > 1.20 | | | | | | | | |--- zip <= 94887.00 | | | | | | | | | |--- cc_avg <= 2.65 | | | | | | | | | | |--- experience <= 15.00 | | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | | | |--- experience > 15.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- cc_avg > 2.65 | | | | | | | | | | |--- weights: [0.00, 4.25] class: 1 | | | | | | | | |--- zip > 94887.00 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- experience > 35.50 | | | | | | | |--- zip <= 92175.00 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- zip > 92175.00 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | |--- mortgage > 141.50 | | | | | | |--- online <= 0.50 | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- online > 0.50 | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | |--- income > 116.50 | | | | | |--- weights: [0.00, 91.80] class: 1 | |--- education_3 > 0.50 | | |--- income <= 116.50 | | | |--- cc_avg <= 2.45 | | | | |--- age <= 41.50 | | | | | |--- family_2 <= 0.50 | | | | | | |--- weights: [3.15, 0.00] class: 0 | | | | | |--- family_2 > 0.50 | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | |--- age > 41.50 | | | | | |--- experience <= 31.50 | | | | | | |--- online <= 0.50 | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | |--- online > 0.50 | | | | | | | |--- zip <= 93596.00 | | | | | | | | |--- id <= 1274.00 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- id > 1274.00 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | |--- zip > 93596.00 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- experience > 31.50 | | | | | | |--- income <= 102.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- income > 102.50 | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | |--- cc_avg > 2.45 | | | | |--- zip <= 90389.50 | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- zip > 90389.50 | | | | | |--- id <= 4852.50 | | | | | | |--- id <= 4505.50 | | | | | | | |--- cd_account <= 0.50 | | | | | | | | |--- income <= 99.50 | | | | | | | | | |--- cc_avg <= 4.00 | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | | |--- cc_avg > 4.00 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- income > 99.50 | | | | | | | | | |--- weights: [0.00, 10.20] class: 1 | | | | | | | |--- cd_account > 0.50 | | | | | | | | |--- cc_avg <= 4.70 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- cc_avg > 4.70 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- id > 4505.50 | | | | | | | |--- id <= 4731.50 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- id > 4731.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- id > 4852.50 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | |--- income > 116.50 | | | |--- cd_account <= 0.50 | | | | |--- weights: [0.00, 66.30] class: 1 | | | |--- cd_account > 0.50 | | | | |--- weights: [0.00, 30.60] class: 1
importance_plot(model=model)
pd.DataFrame(model.feature_importances_,
columns=['Imp'],
index=X_train.columns).sort_values(by='Imp', ascending=False)
| Imp | |
|---|---|
| income | 5.917204e-01 |
| education_2 | 8.813411e-02 |
| cc_avg | 8.307066e-02 |
| family_4 | 7.178685e-02 |
| family_3 | 7.032437e-02 |
| education_3 | 3.514783e-02 |
| id | 1.434689e-02 |
| cd_account | 1.109840e-02 |
| zip | 9.899004e-03 |
| experience | 8.495228e-03 |
| age | 7.795126e-03 |
| mortgage | 3.465266e-03 |
| securities_account | 2.769228e-03 |
| online | 1.946623e-03 |
| family_2 | 2.902121e-17 |
| credit_card | 0.000000e+00 |
# Create an estimator using the DecisionTreeClassifier
estimator = DecisionTreeClassifier(random_state=1, class_weight={0:.15, 1:.85})
# Parameters to check
parameters = {'max_depth': np.arange(1,10),
'criterion': ['entropy', 'gini'],
'min_impurity_decrease': [0.000001, 0.00001, 0.0001, 0.001, 0.01],
'max_features': ['log2','sqrt']}
# defind the scoring used
scorer = metrics.make_scorer(metrics.recall_score)
# Grid Search
grid_obj = GridSearchCV(estimator, param_grid=parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# set the estimator to be the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the training data
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=3,
max_features='log2', min_impurity_decrease=1e-06,
random_state=1)
make_confusion_matrix(estimator, y_test)
get_recall_score(estimator)
Recall on training set : 0.8882175226586103 Recall on test set : 0.8053691275167785
plt.figure(figsize=(20,30))
out = tree.plot_tree(estimator,
feature_names=feature_names,
filled=True,
fontsize=10,
node_ids=True,
class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
print(tree.export_text(estimator,
feature_names=feature_names,
show_weights=True))
|--- education_3 <= 0.50 | |--- cd_account <= 0.50 | | |--- cc_avg <= 3.05 | | | |--- weights: [279.30, 31.45] class: 0 | | |--- cc_avg > 3.05 | | | |--- weights: [45.45, 77.35] class: 1 | |--- cd_account > 0.50 | | |--- education_2 <= 0.50 | | | |--- weights: [9.15, 19.55] class: 1 | | |--- education_2 > 0.50 | | | |--- weights: [3.60, 35.70] class: 1 |--- education_3 > 0.50 | |--- cc_avg <= 2.55 | | |--- income <= 109.50 | | | |--- weights: [121.95, 0.00] class: 0 | | |--- income > 109.50 | | | |--- weights: [3.45, 23.80] class: 1 | |--- cc_avg > 2.55 | | |--- cc_avg <= 4.55 | | | |--- weights: [12.45, 46.75] class: 1 | | |--- cc_avg > 4.55 | | | |--- weights: [0.00, 46.75] class: 1
pd.DataFrame(estimator.feature_importances_,
columns=['Imp'],
index=X_train.columns).sort_values(by='Imp', ascending=False)
| Imp | |
|---|---|
| cc_avg | 0.596320 |
| cd_account | 0.190115 |
| income | 0.174264 |
| education_3 | 0.030519 |
| education_2 | 0.008783 |
| id | 0.000000 |
| age | 0.000000 |
| experience | 0.000000 |
| zip | 0.000000 |
| mortgage | 0.000000 |
| securities_account | 0.000000 |
| online | 0.000000 |
| credit_card | 0.000000 |
| family_2 | 0.000000 |
| family_3 | 0.000000 |
| family_4 | 0.000000 |
importance_plot(model=estimator)
clf = DecisionTreeClassifier(random_state=1, class_weight = {0:0.15, 1:0.85})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | -6.850004e-15 |
| 1 | 1.320471e-19 | -6.849872e-15 |
| 2 | 7.482671e-19 | -6.849124e-15 |
| 3 | 7.482671e-19 | -6.848376e-15 |
| 4 | 7.482671e-19 | -6.847627e-15 |
| 5 | 1.628581e-18 | -6.845999e-15 |
| 6 | 2.288817e-18 | -6.843710e-15 |
| 7 | 2.332833e-18 | -6.841377e-15 |
| 8 | 2.420864e-18 | -6.838956e-15 |
| 9 | 2.905037e-18 | -6.836051e-15 |
| 10 | 3.521257e-18 | -6.832530e-15 |
| 11 | 4.665666e-18 | -6.827864e-15 |
| 12 | 6.998499e-18 | -6.820866e-15 |
| 13 | 9.478050e-18 | -6.811388e-15 |
| 14 | 1.003558e-17 | -6.801352e-15 |
| 15 | 4.407293e-16 | -6.360623e-15 |
| 16 | 1.936722e-04 | 7.746886e-04 |
| 17 | 1.972347e-04 | 1.169158e-03 |
| 18 | 3.369896e-04 | 1.506148e-03 |
| 19 | 3.641037e-04 | 2.962562e-03 |
| 20 | 3.643130e-04 | 3.326875e-03 |
| 21 | 3.685823e-04 | 4.432622e-03 |
| 22 | 3.744328e-04 | 4.807055e-03 |
| 23 | 3.797065e-04 | 5.186762e-03 |
| 24 | 3.879017e-04 | 5.574663e-03 |
| 25 | 3.885915e-04 | 6.351846e-03 |
| 26 | 3.928099e-04 | 6.744656e-03 |
| 27 | 5.860688e-04 | 7.330725e-03 |
| 28 | 6.546462e-04 | 7.985371e-03 |
| 29 | 6.554717e-04 | 8.640843e-03 |
| 30 | 6.706505e-04 | 9.311494e-03 |
| 31 | 6.758139e-04 | 9.987307e-03 |
| 32 | 7.887090e-04 | 1.235343e-02 |
| 33 | 8.789656e-04 | 1.323240e-02 |
| 34 | 9.093369e-04 | 1.414174e-02 |
| 35 | 9.404360e-04 | 1.508217e-02 |
| 36 | 9.407728e-04 | 1.696372e-02 |
| 37 | 9.951370e-04 | 1.895399e-02 |
| 38 | 1.011155e-03 | 1.996515e-02 |
| 39 | 1.013173e-03 | 2.097832e-02 |
| 40 | 1.018946e-03 | 2.199727e-02 |
| 41 | 1.115952e-03 | 2.311322e-02 |
| 42 | 1.470617e-03 | 2.458383e-02 |
| 43 | 1.638043e-03 | 2.622188e-02 |
| 44 | 1.686407e-03 | 2.959469e-02 |
| 45 | 1.843638e-03 | 3.143833e-02 |
| 46 | 2.602631e-03 | 3.404096e-02 |
| 47 | 2.742431e-03 | 3.678339e-02 |
| 48 | 3.335999e-03 | 4.011939e-02 |
| 49 | 3.409906e-03 | 4.352930e-02 |
| 50 | 3.527226e-03 | 4.705652e-02 |
| 51 | 4.797122e-03 | 5.665076e-02 |
| 52 | 5.138280e-03 | 6.178904e-02 |
| 53 | 6.725814e-03 | 6.851486e-02 |
| 54 | 2.253222e-02 | 9.104708e-02 |
| 55 | 3.057320e-02 | 2.133399e-01 |
| 56 | 2.537957e-01 | 4.671356e-01 |
fig, ax = plt.subplots(figsize=(15,7))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle='steps-post')
ax.set_xlabel('Effective alpha')
ax.set_ylabel('Total impurity of leaves')
ax.set_title('Total impurity vs Effective alpha for training set')
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1,
ccp_alpha=ccp_alpha,
class_weight = {0:0.15, 1:0.85})
clf.fit(X_train, y_train)
clfs.append(clf)
print(f'Number of nodes in the last tree is: {clfs[-1].tree_.node_count} with ccp_alpha: {ccp_alphas[-1]}')
Number of nodes in the last tree is: 1 with ccp_alpha: 0.25379571489480973
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(15,10), sharex=True)
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle='steps-post')
ax[0].set_ylabel('Number of nodes')
ax[0].set_title('Number of nodes vs alpha')
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle='steps-post')
ax[1].set_xlabel('alpha')
ax[1].set_ylabel('depth of tree')
ax[1].set_title('Depth vs alpha')
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train3 = clf.predict(X_train)
values_train = metrics.recall_score(y_train, pred_train3)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test3 = clf.predict(X_test)
values_test = metrics.recall_score(y_test, pred_test3)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15,7))
ax.set_xlabel('alpha')
ax.set_ylabel('Recall')
ax.set_title('Recall vs alpha for training and testing sets')
ax.plot(ccp_alphas,
recall_train,
marker='o',
label='train',
drawstyle='steps-post',)
ax.plot(ccp_alphas,
recall_test,
marker='o',
label='test',
drawstyle='steps-post')
ax.legend()
plt.show()
### Create a moded where we get the highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.006725813690406909,
class_weight={0: 0.15, 1: 0.85}, random_state=1)
best_model.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.006725813690406909,
class_weight={0: 0.15, 1: 0.85}, random_state=1)
make_confusion_matrix(best_model, y_test)
get_recall_score(best_model)
Recall on training set : 0.9909365558912386 Recall on test set : 0.9865771812080537
plt.figure(figsize=(20,8))
out = tree.plot_tree(best_model,
feature_names=feature_names,
filled=True,
fontsize=12,
node_ids=True,
class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- income <= 98.50 | |--- cc_avg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- cc_avg > 2.95 | | |--- weights: [18.60, 18.70] class: 1 |--- income > 98.50 | |--- education_3 <= 0.50 | | |--- education_2 <= 0.50 | | | |--- family_3 <= 0.50 | | | | |--- family_4 <= 0.50 | | | | | |--- weights: [67.65, 2.55] class: 0 | | | | |--- family_4 > 0.50 | | | | | |--- weights: [0.15, 16.15] class: 1 | | | |--- family_3 > 0.50 | | | | |--- weights: [1.50, 29.75] class: 1 | | |--- education_2 > 0.50 | | | |--- weights: [6.75, 101.15] class: 1 | |--- education_3 > 0.50 | | |--- weights: [6.60, 113.05] class: 1
importance_plot(model=best_model)
best_model2 = DecisionTreeClassifier(ccp_alpha=0.01,
class_weight={0:0.15, 1:0.85},
random_state=1)
best_model2.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.01, class_weight={0: 0.15, 1: 0.85},
random_state=1)
make_confusion_matrix(best_model2, y_test)
get_recall_score(best_model2)
Recall on training set : 0.9909365558912386 Recall on test set : 0.9865771812080537
plt.figure(figsize=(20,8))
out = tree.plot_tree(best_model2,
feature_names=feature_names,
filled=True,
fontsize=12,
node_ids=True,
class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
print(tree.export_text(best_model2, feature_names=feature_names, show_weights=True))
|--- income <= 98.50 | |--- cc_avg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- cc_avg > 2.95 | | |--- weights: [18.60, 18.70] class: 1 |--- income > 98.50 | |--- education_3 <= 0.50 | | |--- education_2 <= 0.50 | | | |--- family_3 <= 0.50 | | | | |--- family_4 <= 0.50 | | | | | |--- weights: [67.65, 2.55] class: 0 | | | | |--- family_4 > 0.50 | | | | | |--- weights: [0.15, 16.15] class: 1 | | | |--- family_3 > 0.50 | | | | |--- weights: [1.50, 29.75] class: 1 | | |--- education_2 > 0.50 | | | |--- weights: [6.75, 101.15] class: 1 | |--- education_3 > 0.50 | | |--- weights: [6.60, 113.05] class: 1
importance_plot(model=best_model2)
comparison_frame = pd.DataFrame({'Model': ['Initial decision tree model', 'Decision tree with hyperparamter tuning',
'Decision tree with post-pruning'],
'Train_Recall':[1, 0.95, 0.99],
'Test_Recall':[0.91, 0.91, 0.98]})
comparison_frame
| Model | Train_Recall | Test_Recall | |
|---|---|---|---|
| 0 | Initial decision tree model | 1.00 | 0.91 |
| 1 | Decision tree with hyperparamter tuning | 0.95 | 0.91 |
| 2 | Decision tree with post-pruning | 0.99 | 0.98 |